Red Hat Enterprise Linux 7 Troubleshooting

Being Proactive, Part 1

Module Topics

  • Being Proactive

  • Monitoring: Centralized Logging

  • Monitoring: Hard Drive Failures

  • Baselining: Using aide

  • Baselining: Using sar

  • Network Monitoring

Being Proactive

Steps that you should take before a problem occurs:

  • Monitoring systems

  • Baselining systems

  • Managing multiple versions of configuration files

  • Writing a disaster recovery plan

Being Proactive

Support Contracts
  • Support contracts for critical systems are essential.

  • Most large software and hardware vendors offer a range of support options.

  • Before you purchase a support contract, check what coverage you will actually get.

  • If support contracts are not enough, you may also need to keep spare hardware on-site.

    • If possible, configure spare hardware with automatic failover to minimize downtime.

    • This is called a "hot swap."

    • You may also want to keep a spare on-site for cold swap components.

  • Track warranties on hardware, as it may help you obtain spares quickly and efficiently.

    • Replace disks that are out of warranty.

    • Avoid using disks that are out of warranty for critical data.

    • Have a replacement plan in place, including funding and migration plans.

    • Know what to do when a warranty expires.

Being Proactive

Documentation

Maintain documents that fully outline and identify the following for your organization:

  • Hardware

  • Software

  • Configuration settings for each component

Being Proactive

Documentation
[student@server1 ~]$ man -k passwd
checkPasswdAccess (3) - query the SELinux policy database in the kernel.
chpasswd (8)          - update passwords in batch mode
ckpasswd (8)          - nnrpd password authenticator
fgetpwent_r (3)       - get passwd file entry reentrantly
getpwent_r (3)        - get passwd file entry reentrantly
...
passwd (1)            - update user's authentication tokens
sslpasswd (1ssl)      - compute password hashes
passwd (5)            - password file
passwd.nntp (5)       - Passwords for connecting to remote NNTP servers
passwd2des (3)        - RFS password encryption
...

Being Proactive

Documentation
  • Most other documentation is found in the /usr/share/doc/ directory, in subdirectories named by the RPM package.

  • If it is not a man page, not an info page, and not part of the GNOME help utility, it is stored here.

  • Many applications have their documentation packaged in a separate RPM package, which may or may not be installed. In Red Hat Enterprise Linux 7, these packages are often found in the Optional tree.

  • To locate the documentation supplied with an RPM package:

    • Use rpm -qd package to list all files flagged %doc.

    • Use rpm -qc package to list all configuration files distributed in the package.

References
  • man(1) and rpm(8) man pages

  • /usr/share/doc/packagename/

Monitoring: Centralized Logging

  • Information gathering is one of the most important phases of troubleshooting.

  • Log files, kernel output, and device output can all help you diagnose your system more quickly.

  • Knowing how to order and search output is essential in troubleshooting.

  • Commands such as grep, uniq, sort, and less are fundamental to finding errors and identifying problems.

  • If possible, compare logs and output with a similar healthy system to locate relevant error messages.

  • Once you locate the errors, you can fix the problem, and then test.

Monitoring: Centralized Logging

  • Good logging practices are prerequisites to effective troubleshooting.

  • Ensure that syslog is running and configured to log information from important services on all systems.

  • Increase the loglevel to aid with troubleshooting. (For example, from info to debug.)

  • Ensure that important messages are forwarded to a central log server, perhaps one that is proactively watching the events to notify you of pending failures.

  • Red Hat Enterprise Linux 7 uses rsyslog for event logging, an enhanced syslog daemon providing support for both UDP and TCP transport, failover destinations, and queued operations.

    • /etc/rsyslog.conf contains numerous comments.

    • See /usr/share/doc/rsyslog-*/ for more info

Monitoring: Centralized Logging

Configuring a Server to Accept Remote Log Messages Using UDP
  1. Uncomment the following lines in /etc/rsyslog.conf:

    $ModLoad imudp
    $UDPServerRun 514
  2. Restart the service:

    [root@server1 ~]# systemctl restart rsyslog
  3. Open the host firewall for inbound port 514/UDP and/or TCP

Monitoring: Centralized Logging

Forwarding Messages via UDP to a Central Log Server
  1. Decide on the types of messages (facility and priority) and the name or IP address of the central log server.

  2. Add a line similar to the following to /etc/rsyslog.conf:

    *.info      @server1
  3. Restart the service:

    [root@desktop1 ~]# systemctl restart rsyslog
  4. Test the forwarding rule with the logger command:

    [root@desktop1 ~]# logger "Hello from desktop1"
    [root@desktop1 ~]# tail /var/log/messages
    Jan 18 14:24:37 desktop1 root: Hello from desktop1
    [root@server1 ~]# tail /var/log/messages
    Jan 18 14:24:37 desktop1 root: Hello from desktop1

References

Monitoring: Hard Drive Failures

  • Hard drives die. It is not a question of if a drive will die but rather when.

  • If you know that a drive is dying, you can plan for its replacement instead of responding to an emergency.

  • SMART = Self-Monitoring, Analysis and Reporting Technology

    • SMART is built-in to almost all modern hard drives.

    • In Red Hat Enterprise Linux systems, the smartd SMART-daemon polls all of the hard drives every 30 minutes. ** If smartd sees that a drive is dying, it issues a message to /var/log/messages and sends an email message to the root user on the local system.

    • You can specify an alternate, centralized email address in /etc/smartmontools/smartd.conf.

Monitoring: Hard Drive Failures

  • Another method of talking to a SMART-enabled drive is with the smartctl tool.

  • One method of using smartctl is to ask for only the overall health status:

    [root@server1 ~]# smartctl -H /dev/sda
    smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.10.0-123.el7.x86_64] (local build)
    Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org
    
    === START OF READ SMART DATA SECTION ===
    SMART Health Status: OK
  • For more detailed information, query all the individual counters: smartctl -a /dev/sda. The column you are interested in is RAW_VALUE.

  • To tell the drive to perform a test immediately, use smartctl -t testtype /dev/sda, where testtype is either offline, long, or short.

  • To view the output of a selftest, (long, short), run smartctl -l selftest /dev/sda.

  • To get the output of the offline test or the errors from any other test, run smartctl -l error /dev/sda.

Reference
  • smartd(8), smartd.conf(5), and smartctl(8) man pages

Baselining: Using AIDE

Good baseline monitoring of systems is extremely helpful when troubleshooting.

  • Compare when a system appears to be behaving erratically

  • Report when a system is operating outside of specified parameters.

  • Tighten security.

  • Build trends for your systems and networks over time.

  • Use trends to spot events outside of the norm.

  • Deciding what to monitor depends on the work that a system does.

    • For database servers or file servers, disk space, service availability, and load might be important.

    • For a desktop system, you might just check to see that it is running.

  • Long-term monitoring can be used to:

    • Measure growth of system load over time

    • Predict when a new server or file store is required

    • Measure how improvements impact service availability and help work flow and numerous other issues

Baselining: Using AIDE

  • AIDE = Advanced Intrusion Detection Environment

  • AIDE is a tool to check the integrity of files on the system.

  • When the system is in a known good state, it is used to scan the system and collect information about installed d files:

    • Checksums

    • Permissions

    • Other characteristics

  • Information is placed in a database file which can be stored offline.

  • Use AIDE to compare the state of the system against the stored database and check for any changes.

Baselining: Using AIDE

Steps to Deploy AIDE

The following is an example of deploying AIDE on server1.

  1. Install the aide package.

    [root@server1 ~]# yum install -y aide
    ... Output omitted ...
  2. Customize /etc/aide.conf to your liking.

    Example
    @@define DBDIR /var/lib/aide (1)
    @@define LOGDIR /var/log/aide
    
    database=file:@@{DBDIR}/aide.db.gz (2)
    database_out=file:@@{DBDIR}/aide.db.new.gz (3)
    gzip_dbout=yes
    report_url=file:@@{LOGDIR}/aide.log (4)
    report_url=stdout
    
    # R is short for p+i+n+u+g+s+m+c+acl+selinux+xattrs+md5
    NORMAL = R+rmd160+sha256 (5)
    PERMS = p+i+u+g+acl+selinux
    
    / NORMAL (6)
    !/etc/.*~
    /root/..* PERMS
    1Defines macros that can be used in /etc/aide.conf.
    2Configuration directive defining the location of the AIDE database. Note that this example uses a macro defined above.
    3Configuration directive defining the location in which aide --init will save a newly created database file.
    4Where the results of aide --check will be reported. Note that multiple locations are allowed.
    5Group definition line. Files selected by AIDE in group NORMAL will store information about its regular permissions, inodes, number of links, user and group, size, mtime and ctime, POSIX ACLs, SELinux context, extended attributes, MD5 checksum, RMD160 checksum, and SHA256 checksum.
    6Selection lines. The first one adds all files under / to be checked in group NORMAL; the second exempts all files in /etc that end in ~ from being checked; the third specifies that all files under /root that start with a period (.)te should be checked in group PERMS only. Note that this uses regular expression syntax.
  3. Run /usr/sbin/aide --init to build the initial database. This can take a while as it creates a gzipped-database called /var/lib/aide/aide.db.new.gz.

    [root@server1 ~]# aide --init
    
    AIDE, version 0.15.1
    
    ### AIDE database at /var/lib/aide/aide.db.new.gz initialized.
  4. Store /etc/aide.conf, /usr/sbin/aide and /var/lib/aide/aide.db.new.gz in a secure location (not on this same system!). Alternatively, extract a signature of these files so they can be verified in the future.

  5. Copy /var/lib/aide/aide.db.new.gz to /var/lib/aide/aide.db.gz (the expected name).

    [root@server1 ~]# cd /var/lib/aide
    [root@server1 aide]# cp aide.db.new.gz aide.db.gz
    [root@server1 aide]# cd

Baselining: Using AIDE

Verifying System Integrity with AIDE

This next example demonstrates testing file integrity using aide.

  1. Modify a file on your system to be different.

    [root@server1 ~]# echo shiny new >> /bin/tcsh
  2. Run /usr/sbin/aide --check to check your system for inconsistencies.

    [root@server1 ~]# aide --check
    AIDE 0.15.1 found differences between database and filesystem!!
    Start timestamp: 2014-12-15 08:22:04
    
    Summary:
      Total number of files:        107530
      Added files:                  9
      Removed files:                0
      Changed files:                10
    
    
    ---------------------------------------------------
    Added files:
    ---------------------------------------------------
    
    ... Output omitted ...
    
    ---------------------------------------------------
    Changed files:
    ---------------------------------------------------
    
    changed: /usr/bin/tcsh
    ... Output omitted ...

    Results are displayed on standard output and in /var/log/aide/aide.log by default.

    If you know about these changes, you can run aide --update to update your database and store it in a secure location again.

References
  • aide(1) and aide.conf(5) man pages

  • AIDE Quick Start: /usr/share/doc/aide-*/README.quickstart

  • AIDE Manual: /usr/share/doc/aide-*/manual.html

Baselining: Using sar

  • sar = System Activity Reporter

  • sar is provided by the sysstat package and does the following:

    • Collects information about system activity from the operating system at a particular point in time.

    • Takes a sample of data over a selected time period, either once or on some repeating schedule.

    • Collected information includes memory usage, disk I/O, network activity, and so on.

  • There are two modes in which sar operates:

    • When sysstat is installed, a cron job is set up that takes a one second sample of system activity every ten minutes and saves it to a file.

      • Use the sar command to read this information.

    • Run sar from the command line to collect specific data, averaged over a certain period of time in seconds, a specified number of times.

Baselining: Using sar

Deploying the sar Command
  • Install the sysstat package. This package provides cron scripts (/etc/cron.d/sysstat) that will gather data automatically.

    [root@server1 ~]# cat /etc/cron.d/sysstat
    
    # Run system activity accounting tool every 10 minutes
    */10 * * * * root /usr/lib64/sa/sa1 1 1
    # 0 * * * * root /usr/lib64/sa/sa1 600 6 &
    # Generate a daily summary of process accounting at 23:53
    53 23 * * * root /usr/lib64/sa/sa2 -A
  • The first column of sar output is the time of the recorded statistics.

    • To ensure this column is always in a format you can parse, prefix your sar commands with LANG=C to get a unified time format.

    • To make this the default for your session, use export LANG=C.

  • Example sar commands:

    • sar -A displays all information collected today.

    • sar -u 2 5 displays five samples of system CPU usage spaced 2 seconds apart.

    • sar -r displays memory statistics.

    • sar -S displays swap space utilization statistics.

    • sar -b displays I/O statistics.

  • To generate useful output, add awk parsing:

    [root@server1 ~]# export LANG=C
    [root@server1 ~]# sar -r | tail -n+5 | awk '{print $1,$4,$8}'
    10:20:01 %memused %swpused
    10:30:01 92.28 0.05
    10:40:01 92.28 0.05
    Average: 92.28 0.05
Reference
  • sar(1), sa1(8), sa2(8), and sadc(8) man pages

Network Monitoring

  • Network monitoring measures network activity, and looks for slow or failing servers, routers, switches, or other devices.

  • There are active and passive monitoring techniques that may involve agents residing on the network equipment that notify or are polled by a network management system.

  • Many enterprises use network monitoring/management systems and services from CA, HP, IBM, and other vendors.

  • Nagios is a free open source monitoring tool.

    • Provided via the EPEL (Extra Packages for Enterprise Linux) repository from the Fedora project

    • Not supported by Red Hat

    • Modular system consisting of a core nagios package with additional functionality provided by plug-ins

    • Plug-ins can run on local machines to provide information not readily available via the network

    • Flexible configuration allows definitions of time periods, admin groups, system groups, and custom command sets

    • Web-based interface on main nagios server allows configuring tests and settings for Nagios and hosts it is monitoring

References

Module Completion

Nice job!

Click the button below to complete this module of the course: